PLU-347: feat(box): add ACL permissions metadata to Box connector#704
Conversation
Fetches collaborations from Box API (direct + inherited via path_collection ancestor walk) and normalizes into permissions_data on FileDataSourceMetadata, consistent with Confluence and Google Drive connectors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
awalker4
left a comment
There was a problem hiding this comment.
Will just need to add a new block to the changelog for whatever the next version is. This value also goes into __version__.py
The cap lived on BoxDownloaderConfig but permissions are now fetched during indexing, so the user's configured value was ignored in normal pipelines (the downloader fallback only fires for standalone use). Moved the field to BoxIndexerConfig where the work happens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Access-only collabs are Box's mechanism for hidden, scoped grants (e.g. shared-link backing). They do not cascade like normal collabs, so picking them up during ancestor-folder walks added users to sibling files' permissions_data, overgranting downstream ACLs. Skip everywhere (folder walk and direct file collabs) for consistency with Box's own access-list semantics, which deliberately hide these grants. Explicitly request is_access_only via fields= on both collab calls so the guard fires reliably. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Box Editors can delete files and folders (view, download, upload, edit, delete, copy, move, rename, share is the full editor capability set). BOX_ROLE_MAPPING omitted delete, so editor users and groups were missing from permissions_data["delete"] — an under-grant on delete access for valid Box collaborators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FsspecDownloader writes each file into a fresh tempfile.mkdtemp
("unstructured_<random>") subdir to avoid path collisions. The
captured directory_structure.json fixtures included that random
suffix, so directory-structure validation would fail on every run
even when downloads succeeded.
Strip the unstructured_<random>/ segment in get_files() so fixtures
capture the logical structure, and regenerate the Box fixtures with
clean paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minor bump for the Box ACL permissions feature, matching the precedent set by the SharePoint ACL pass-through (1.4.29 -> 1.5.0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ssions # Conflicts: # CHANGELOG.md # unstructured_ingest/__version__.py
ACL reading is a substantial enough feature to warrant a minor bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A transient 503 or timeout against `client.folder(id).get_collaborations` was being swallowed and an empty list cached for the rest of the indexer run, silently zeroing out permissions for every descendant file sharing that ancestor. Return without writing the cache on exception so the next file retries, and bump the log to warning so the failure is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_files() strips the unstructured_<random>/ tempdir prefix for fixture comparison, but the actual downloaded files still live under that subdir. Reusing the stripped paths for filesystem lookups would FileNotFoundError for any future fsspec connector test with validate_downloaded_files=True. Walk the real tree, then strip the prefix only when resolving the expected fixture path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Box previewers can render content in the web UI but cannot download files, so mapping them to ["read"] would overstate access in downstream ACL consumers that interpret read as download capability. Mirrors the existing exclusion of "uploader" (write-only, can't view). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@claude please review. Be thorough when looking at the ACL properties as we want to make sure this is correct |
|
Claude finished @awalker4's task —— View job Box ACL Permissions ReviewI've conducted a thorough review of the ACL permissions implementation. Here are my findings: Todo List:
Critical Issues Found🚨 HIGH SEVERITY: Editor Role Missing Delete PermissionIssue: Line 41 in However, there's a legitimate concern about previewer roles: 🚨 MEDIUM SEVERITY: Previewer Roles Overgrant PermissionsIssue: The current implementation correctly excludes previewer roles from Status: ✅ CORRECTLY HANDLED - The implementation properly excludes previewers. 🚨 HIGH SEVERITY: Access-Only Collaborations Properly FilteredIssue: Lines 64-67 correctly filter out Status: ✅ CORRECTLY HANDLED 🚨 MEDIUM SEVERITY: API Error Caching IssueIssue: In Status: 🚨 LOW SEVERITY: JWT Client Recreation in DownloaderIssue: Lines 309-310 show the downloader properly memoizes the Box client with Status: ✅ CORRECTLY HANDLED Positive Implementation Details✅ Role Mapping LogicThe
✅ Permission NormalizationThe
✅ Caching Strategy
✅ Test CoverageThe unit tests (
Recommendations
ConclusionThe ACL implementation is fundamentally sound and secure. The core permission logic correctly maps Box roles to normalized permissions and properly filters out problematic collaboration types. Most of the Cursor Bugbot findings appear to be false positives or have already been addressed in the current code. The implementation follows security best practices by:
Overall Assessment: ✅ APPROVED with minor optimization opportunities noted above. |
| logger.error(f"unhandled exception from box ({type(e)}): {e}", exc_info=True) | ||
| return e | ||
|
|
||
| @requires_dependencies(["boxsdk"], extras="box") |
There was a problem hiding this comment.
boxsdk is not listed in the box extras in pyproject.toml (box = ["boxfs", "fsspec"]). This decorator tells users to pip install unstructured-ingest[box], but that won't install boxsdk. Confirmed boxfs does not pull in boxsdk as a transitive dependency either.
The ACL feature silently degrades (the except at line 268 catches the import failure and logs a warning, yielding None permissions). One-line fix: add "boxsdk" to the extras in pyproject.toml.
| try: | ||
| if self._box_client is None: | ||
| self._box_client = self.connection_config.get_box_client() | ||
| file_data.metadata.permissions_data = _get_permissions_for_file( |
There was a problem hiding this comment.
The downloader fallback calls _get_permissions_for_file without passing max_perms, falling through to the default of 500. The indexer at line 280 passes self.index_config.max_num_metadata_permissions. If a user configures a non-default cap (say 50), the indexer respects it but the standalone downloader path silently ignores it.
Confluence avoids this by putting max_num_metadata_permissions on ConfluenceDownloaderConfig (confluence.py:249). Worth mirroring that pattern on BoxDownloaderConfig if the standalone path is a supported use case.
| logger.warning(f"Could not retrieve collaborations for folder {folder_id}: {e}") | ||
| return collabs | ||
|
|
||
| if len(cache) >= 5: |
There was a problem hiding this comment.
Cache capacity is hardcoded to 5. Confluence makes this configurable via _permissions_cache_max_size: int = 5 (confluence.py:277). For Box, the cache stores ancestor folder collaborations, and folder trees with more than 5 sibling folders at any level will thrash (evict entries needed by the next file in the same branch). Worth exposing on BoxIndexerConfig alongside max_num_metadata_permissions, with a higher default (e.g. 128) since each entry is a small list of dicts.
| # Default True so a failed file.get() doesn't suppress the direct-collab fetch. | ||
| has_direct_collabs = True | ||
| try: | ||
| file_obj = client.file(file_id).get(fields=["path_collection", "has_collaborations"]) |
There was a problem hiding this comment.
Every file calls client.file(file_id).get(fields=["path_collection", "has_collaborations"]) to discover its ancestor chain, but files sharing a parent folder return identical path_collection data. For 1,000 files across 50 folders, that's ~950 redundant calls.
A follow-up optimization: cache the path-to-ancestors mapping per parent folder via folder.get(fields=["path_collection"]) once per unique parent, then reuse for all files within it. Not blocking for the initial ship, but worth tracking since Box rate-limits at 1,000 calls/minute for Enterprise accounts.
…aches, path_collection dedup) - Add boxsdk to the box extras so `pip install unstructured-ingest[box]` actually installs what `@requires_dependencies` advertises; without it the ACL feature silently degraded. - Expose `max_num_metadata_permissions` and `permissions_cache_max_size` on both `BoxIndexerConfig` and `BoxDownloaderConfig`; the downloader fallback path now respects user-configured caps instead of hardcoded defaults. - Replace the hardcoded 5-entry ancestor-folder cache with a configurable LRU (default 128) so wide folder trees stop thrashing across sibling files. - Cache `path_collection` ancestor IDs by parent path so only the first file in a given parent pays the `file.get()` round-trip; siblings reuse the cached chain. `has_collaborations` short-circuit is preserved on the first-file path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onfigs CLI consolidates options with the same name across indexer/downloader configs and rejects them if any attribute (including help text) differs. The longer indexer-side description tripped this in test_ingest_help; collapse both to the shorter shared form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
brandon-u10d
left a comment
There was a problem hiding this comment.
All four prior comments verified addressed.
| and parent_path is not None | ||
| and parent_path in parent_chain_cache | ||
| ): | ||
| ancestor_folder_ids = parent_chain_cache[parent_path] |
There was a problem hiding this comment.
The parent_chain_cache hit skips file.get() (good), but also skips reading has_collaborations, so has_direct_collabs stays at its default True. Sibling files in folders without direct collaborations make one unnecessary get_collaborations() call each.
Non-blocking; matches every other connector's unconditional-fetch pattern. Small follow-up optimization:
| ancestor_folder_ids = parent_chain_cache[parent_path] | |
| ancestor_folder_ids, has_direct_collabs = parent_chain_cache[parent_path] |
Corresponding write at line 155: parent_chain_cache[parent_path] = (ancestor_folder_ids, has_direct_collabs)
Summary
permissions_dataonFileDataSourceMetadata, consistent with the Confluence and Google Drive implementationsBoxIndexer.run()so they're available to all downstream pipeline stagesBoxDownloader.run()retains a fallback for standalone usage (CLI, integration tests without the SND plugin layer)What changed
unstructured_ingest/processes/connectors/fsspec/box.pyBOX_ROLE_MAPPING— maps Box collaboration roles to[read],[read, update], or[read, update, delete];uploaderexcluded (write-only)_normalize_collaborations(),_get_collaborations_for_folder(),_get_permissions_for_file()helpersBoxIndexer.run()override — initializes a Box SDK client once, then for each indexed file walkspath_collectionancestor folders (LRU-cached, max 5 entries) plus direct file collaborations to build normalized permissionsBoxDownloader.run()fallback — only fetches permissions ifpermissions_data is None(i.e., indexer wasn't run)BoxConnectionConfig.get_box_client()— returns an authenticatedboxsdk.Clientvia JWTTests
BOX_ROLE_MAPPING,_normalize_collaborations, and_get_permissions_for_file(all mock-based)Test plan
permissions_datapopulated with correct user IDsCloses PLU-347
🤖 Generated with Claude Code
Note
Medium Risk
Adds new Box API calls during indexing/downloading to derive and attach normalized ACL metadata, which can impact performance and correctness of permission propagation if edge cases are missed. Risk is mitigated by caps, caching, and extensive unit/integration test coverage.
Overview
Adds ACL permission metadata passthrough to the Box connector by deriving
metadata.permissions_datafrom Box collaborations and normalizing them intoread/update/deleteuser/group ID lists.Permissions are fetched primarily at index time via an ancestor-folder collaboration walk with a small LRU cache and a configurable cap (
BoxIndexerConfig.max_num_metadata_permissions), with a downloader-side fallback for standalone usage.Updates/extends tests: new Box unit tests for role mapping and permission normalization, new Box integration tests plus expected fixtures, and a test harness fix to strip randomized
unstructured_<tmp>/path segments so fixture directory structures are stable. Docs images paths and package version/changelog are also updated for the1.6.0release.Reviewed by Cursor Bugbot for commit 312f7c5. Bugbot is set up for automated code reviews on this repo. Configure here.